Document Organization and Retrieval using Self Organizing Maps and Statistical Language Modeling

نویسندگان

  • Apostolos Georgakis
  • Constantine Kotropoulos
  • Alexandros Xafopoulos
  • Ioannis Pitas
چکیده

In this paper we present a method for document organization and retrieval based on statistical language modeling.The proposed method, which is based on the vector model, uses nonlinear interpolation to provide more accurate statistical estimators of the conditional probabilities employed for encoding the context of each word. An information retrieval system is built using the self-organizing map algorithm. In the rst step, the self-organizing architecture is used to cluster the feature vectors and to build clusters of semantically related words. Subsequently, the collection of documents is encoded into vectors and the same algorithm is used to cluster the documents in contextually related classes. The information retrieval system is queried using a sample document and the corresponding precision-recall curve is provided.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An approach based on language modeling and neural networks

This thesis covers topics relevant to information organization and retrieval. The main objective of the work is to provide algorithms that can elevate the recall-precision performance of retrieval tasks in a wide range of applications ranging from document organization and retrieval to web-document pre-fetching and finally clustering of documents based on novel encoding techniques. The first pa...

متن کامل

Self-organizing Maps in Natural Language Processing

Kohonen's Self-Organizing Map (SOM) is one of the most popular arti cial neural network algorithms. Word category maps are SOMs that have been organized according to word similarities, measured by the similarity of the short contexts of the words. Conceptually interrelated words tend to fall into the same or neighboring map nodes. Nodes may thus be viewed as word categories. Although no a prior...

متن کامل

A combination of Wilcoxon test and R-estimates for document organization and retrieval

The Wilcoxon signed-rank test is exploited for document organization and retrieval in this paper. A novel modeling method for documents and a distance metric between documents are proposed. Both document modeling and document comparisons are based on signed-ranks and are applied to the frequency of occurrence of the document bigrams. A metric using the Wilcoxon signed-rank test exploits these s...

متن کامل

Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps

Clustering and visualization of large text document collections aids in browsing, navigation, and information retrieval. We present a document clustering and visualization method based on Latent Dirichlet Allocation and self-organizing maps (LDA-SOM). LDA-SOM clusters documents based on topical content and renders clusters in an intuitive twodimensional format. Document topics are inferred usin...

متن کامل

Indexing Audio Documents by using Latent Semantic Analysis and SOM

This paper describes an important application for state-of-art automatic speech recognition , natural language processing and information retrieval systems. Methods for enhancing the indexing of spoken documents by using latent semantic analysis and self-organizing maps are presented, motivated and tested. The idea is to extract extra information from the structure of the document collection an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001